In Project 1, we examined the AirBnB dataset and conducted basic exploratory data analysis (EDA) with statistical inference to draw baseline relationships between different variables. While exercises in correlation using t-tests, chi-squared tests, and ANOVA tests yielded interesting results, we wanted to dive deeper into the stories behind the numbers. For that, we turn to regressions in order to determine causation between variables and introduce a dataset on crime in order to better extrapolate results to the broader population of AirBnBs. The following paper will continue by first providing a high-level EDA for the AirBnB and crime datasets. Furthermore, the paper will proceed by conducting a variety of regression techniques. We begin first with AirBnB variables only, examining what factors affect listing prices. Then, we overlay the crime data onto the AirBnB data to investigate the effects of crime on prices. The penultimate section of this paper will compare the regression results against one another in order to determine the best model(s). Lastly, this paper will conclude with a summary of findings and areas for further research.
To begin, we present various summary statistics for the two datasets (AirBnB and crime) that we are investigating. Below are the structure printouts for both datasets, beginning with AirBnB, followed by crime:
In the AirBnB (“listings”) dataset, there are 9,126 observations and 16 variables. The AirBnB dataset is relatively comprehensive, consisting of various qualitative variables including unique ID, name, and neighbourhood. Additionally, there are a number of quantitative variables such as price and number of reviews. More importantly, the listings dataset contains latitude and longitude coordinates that will serve as the link to the crime dataset.
On the other hand, the crime dataset contains 29,045 observations and 25 variables. Furthermore, this dataset shows 9 types of offenses as well as the method of crime (gun, knife, others) and time of day (day, evening, midnight). Similarly, the crime dataset is labeled by latitude and longitude coordinates as well as census tract, which will be important variables for joining the two datasets together.
An important step to EDA is exploring the data through summary statistics. Since we have already examined the listings dataset thoroughly in Project 1, the following section will primarily focus on the crime dataset. Since we are interested in how crime levels affect AirBnB prices, it is important to take a closer look at the different types of crime. Below is a summary table of the number of crimes by offense and ward:
| WARD | ARSON | ASSAULT W/DANGEROUS WEAPON | BURGLARY | HOMICIDE | MOTOR VEHICLE THEFT | ROBBERY | SEX ABUSE | THEFT F/AUTO | THEFT/OTHER | TOTAL |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | 151 | 117 | 15 | 221 | 345 | 17 | 1443 | 1806 | 4116 |
| 2 | 1 | 119 | 168 | 0 | 225 | 232 | 26 | 1898 | 3467 | 6136 |
| 3 | 0 | 21 | 72 | 3 | 89 | 35 | 10 | 566 | 898 | 1694 |
| 4 | 0 | 60 | 110 | 4 | 176 | 136 | 13 | 857 | 863 | 2219 |
| 5 | 1 | 241 | 208 | 17 | 347 | 312 | 22 | 1441 | 1729 | 4318 |
| 6 | 2 | 150 | 119 | 16 | 248 | 309 | 23 | 1625 | 2411 | 4903 |
| 7 | 0 | 324 | 137 | 38 | 421 | 317 | 30 | 756 | 1191 | 3214 |
| 8 | 3 | 320 | 175 | 54 | 249 | 257 | 37 | 521 | 829 | 2445 |
Ward 2 has the most amount of crimes (6,136), followed by Ward 6 at 4,903 crimes. Moreover, crime type “Theft/Other” is the most common in all wards as seen in the bar chart above.
To better understand the crime data and the underlying relationships, data visualization is a useful tool. This section presents two charts: a bar chart and pie chart. The bar chart is presented below:
Additionally, the same data can be visualized as a pie chart, which shows the percentages of the total number of crimes relative to each ward:
These summary statistics and charts are particularly important in the context of AirBnB listings. It is not unreasonable to hypothesize that wards with higher number of crimes overall may also exhibit an adverse effect on listing prices. As fewer people want to live in those areas, demand for AirBnBs decrease and in turn, so do prices. Further analysis using regression techniques will be needed to determine the overall effect of crime on prices.
After conducting EDA and looking at the variables at a high-level, we move onto generating regression models to estimate causal relationships between the two datasets. In this section, we examine five models using the techniques learned in class, beginning with a simple linear regression model within the AirBnB dataset alone. Then we move to a logistic regression in the same dataset. In the third model, we overlay the crime dataset onto the listings dataset in order to explore how crime affects AirBnB listing prices. The penultimate model consists of a hedonic regression used to predict price. Lastly, we implement machine learning methodology to breakdown the primary drivers of price.
[POLLY’S MODEL]
[ELISE’S MODEL]
[JEFFREY’S MODEL]
[PANCEA’S MODEL]
[MATT’S MODEL]